R5 Documentation

unit tests test coverage documentation doi

This is the documentation for the repository for the article Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions by Fabien C. Y. Benureau and Nicolas P. Rougier. This code is meant as an example of a full R^4 implementation: re-runnable, repeatable, reproducible, and reusable. A repository with the five codes presented in the article is available at github.com/rougier/random-walk (10.5281/zenodo.848221).

This code exposes two central functions, r5.walk() that generates the walk, and r5.walk_full() that generates the walk and returns it with full provenance data (parameters, python version, platform, git hash):

r5.walk(n, seed=1)[source]

Generate a random walk.

The walk is initialized at zero, and this initial state is included in the walk.

Parameters:
  • n – the number of steps of the walk.
  • seed – the seed of for the random number generator. Each walk has an independent random number generator.
r5.walk_full(n, seed=1, dirty=False)[source]

Generate a random walk, and return it with full provenance data, ready to be saved.

Parameters:dirty – By default, the presence of a clean git repository is enforced during provenance data retrieval. Not doing so may miss important uncommited changes affecting the computation of the result. Set dirty to True to bypass this (not recommended).

Re-runnable

The r5 code requires Python 3. The preferred installation method is to clone the existing git repository, and run the code from there. This is necessary to retrieve the git part of the provenance data:

git clone https://github.com/benureau/r5
cd r5
python setup.py develop

We use python setup.py develop rather than python setup.py install to avoid divorcing the code from the version control system. There are ways to record the git data at installation, such as the versioneer package, but we don’t use it here to keep things as simple as possible.

A test is then provided to check that the code is indeed re-runnable:

pytest r5/tests/test_rerunnable.py

An example is also included in the examples folder:

python examples/example.py

Repeatable

Random seeds for the random number generator are explicitly set with each invocation of r5.walk() (the seed is 1 if none is provided). You can explicitly verify that the code produce repeatable results with:

pytest tests/test_repeatable.py

Reproducible

The main thing that makes the code reproducible is the addition of provenance data tracking to record the context in which the walk is computed. This provenance data contains details about the computer platform and the python version, the packages installed and their versions, the version of the code (git SHA1 hash) and the parameters used to generate the results.

r5.provenance(dirty=False)[source]

Return provenance data about the execution environment.

Parameters:dirty – if False, will exit with an error if the git repository is dirty or absent.

It is assumed that the code is executed in its git repository, with no uncommitted files. That makes the SHA-1 of the current commit a full description of the state of the code used to compute the results. If the repository is dirty (uncommitted changes or untracked files are present) or unavailable (if the package was installed with python setup.py install for instance), an error is raised, and the user is informed that it must explicitly bypass the requirement of a clean git repository by set the dirty argument to true in the r5.walk_full() function.

Such “dirty” runs of the code might be useful during development and debugging, but they should not be used to produce published results.

To test reproducibility, the code checks if it generates the same result that previous versions of the code. If the code is purposefully changed to modify behavior, then the test data must be regenerated, by executing the tests/generate_testdata.py file. If not, the test catches unintentional semantic changes to the code. You can run the test with:

pytest tests/test_reproducible.py

Furthermore, the code is hosted via the Zenodo platform with a DOI, ensuring it remains reachable and available for the foreseeable future.

Reusable

This repository was designed to function as a solid code foundation to start new projects, with all the battery of the \(\textrm{R}^4\) code included. Care has been given to create simple and well-documented code, that is easy to install and use.

A setup.py file is provided, along with a requirement.txt file that list the dependencies one should install. Examples and tests are included, and the latter are automatically evaluated for each commit pushed to the GitHub repository using the Travis continuous integration service.

You can, prior to committing the files, run those tests efficiently with:

pytest

at the root of the repository. You can further examine how completely the tests cover the code by running:

pytest --cov

The coverage is full for the walker.py file where the core code resides; it is partial for provenance.py, as some cases of the git repository being dirty or absent are not tested.

This code is certainly not perfect, and would it be, it would not remain that way as the rest of the software stack and computer technology shift around it. Tools such as containers may help long-term rerunnability and provenance tracking once the technology matures and takes into accounts the specific constraints of long-term scientific reproducibility.

Do not hesitate to copy or fork this code or take just bits of it to make your current or next projects (hopefully) better. If stumbling on a problem, drop us an issue on GitHub.